GLADE-ML: A Database For Big Data Analytics
نویسنده
چکیده
Big Data Analytics has been a hot topic in computing systems and varies systems have emerged to better support Big Data Analytics. Though databases have been the data hub for decades, they fall short of Big Data Analytics due to inherent limitations. This dissertation present GLADEML, a scalable and efficient parallel database that is specifically tailored for Big Data Analytics. Different from traditional databases, GLADE-ML provides iteration management and explicit or implicit randomization in the execution strategy. GLADE-ML provides in-database analytics which outperforms other in-database analytics solutions by several orders of magnitude. GLADE-ML also introduces dot-product join operator in GLADE-ML. Dot-product join operator is specifically designed for Big Models. Big Data analytics has been approached exclusively from a data-parallel perspective, where data are partitioned to multiple workers threads or separate servers and model training is executed concurrently over different partitions, under various synchronization schemes that guarantee speedup and/or convergence. The dual – Big Model – problem that, surprisingly, has received no attention in database analytics, is how to manage models with millions if not billions of parameters that do not fit in memory. This distinction in model representation changes fundamentally how in-database analytics tasks are carried out. GLADE-ML supports model parallelism over massive models that cannot fit in memory. GLADE-ML extends the lock-free HOGWILD!-family of algorithms to disk-resident models by vertically partitioning the model offline and asynchronously updating the resulting partitions online. Unlike HOGWILD!, concurrent requests to the common model are minimized by a preemptive push-based sharing mechanism that reduces both the number of disk accesses as well as the cache coherency messages between workers. Extensive experimental results for three widespread analytics tasks on real and synthetic datasets show that the proposed framework achieves similar convergence to HOGWILD!, while being the only scalable solution to disk-resident models. Another distinct feature of GLADE-ML is Hyper-Parameter Tuning. Identifying the optimal hyper-parameters is a time-consuming process that the computation has to be executed from scratch for every dataset/model combination even by experienced data scientists. GLADE-ML provides speculative parameter testing which applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration.
منابع مشابه
Scalable Analytics Model Calibration with Online Aggregation
Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the lack of support to quickly identify sub-optimal conf...
متن کاملSpeculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurat...
متن کاملBig Data Analytics and Now-casting: A Comprehensive Model for Eventuality of Forecasting and Predictive Policies of Policy-making Institutions
The ability of now-casting and eventuality is the most crucial and vital achievement of big data analytics in the area of policy-making. To recognize the trends and to render a real image of the current condition and alarming immediate indicators, the significance and the specific positions of big data in policy-making are undeniable. Moreover, the requirement for policy-making institutions to ...
متن کاملA Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملApplication of Big Data Analytics in Power Distribution Network
Smart grid enhances optimization in generation, distribution and consumption of the electricity by integrating information and communication technologies into the grid. Today, utilities are moving towards smart grid applications, most common one being deployment of smart meters in advanced metering infrastructure, and the first technical challenge they face is the huge volume of data generated ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016